fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe… by rustamzh · Pull Request #174 · rvandewater/YAIB

rustamzh · 2025-10-22T12:26:59Z

When wandb detected pass to Trainer(...,devices=1,...) otherwise wandb creates multiple folders and crashes. Fixes #173.

Summary by CodeRabbit

Chores
- GPU device allocation for training is now dynamically managed based on runtime integration configuration. When weights & biases is active for experiment tracking and monitoring, training restricts operations to a single GPU for consistency; otherwise, training utilizes all available GPUs to maximize parallel execution capabilities and significantly improve overall computational throughput.

…1,...) otherwise wandb creates multiple folders and crashes.

coderabbitai · 2025-10-22T12:27:28Z

Walkthrough

Modified GPU device management in the training module to conditionally restrict to a single GPU when using wandb. Addresses multi-GPU sweep crashes by setting devices = 1 when wandb is enabled, falling back to multi-GPU usage otherwise.

Changes

Cohort / File(s)	Summary
GPU device management `icu_benchmarks/models/train.py`	Added conditional logic to manage GPU device count: initializes to `max(cuda_device_count, 1)`, then restricts to `devices = 1` when wandb is active to prevent multi-GPU sweep crashes. Updated trainer initialization to use the managed devices variable.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 When sweeping through the GPU garden, one runs faster than many—
A wandb waltz needs focus, so we set the device to one, not any.
Multi-GPU melted, now single GPU glows bright, 🎆
The trainer hops smoothly through the PyTorch night! ✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe…" is directly related to the main change in the changeset. The title clearly references the specific fix being implemented: when wandb is detected, the Trainer is invoked with devices=1 to prevent multi-GPU issues. This aligns with the raw summary which shows the introduction of a devices variable that restricts to a single GPU when wandb is used. The title is concise, specific about the change, and references the linked issue.
Linked Issues Check	✅ Passed	The code changes directly address the primary objective of issue #173, which is to prevent wandb from creating multiple folders and crashing on multi-GPU systems by restricting execution to a single GPU. The implementation introduces a devices variable that sets devices to 1 when wandb is detected, which corresponds precisely to the first proposed solution mentioned in the issue. The change modifies the Trainer initialization to use this variable, ensuring that when wandb is active, the training runs with a single GPU device as required.
Out of Scope Changes Check	✅ Passed	The changes are focused and in-scope, consisting solely of modifications to icu_benchmarks/models/train.py related to the devices variable management logic for handling wandb usage. The raw summary indicates no alterations to exported or public entities, and all changes directly support the objective of fixing the wandb multi-GPU crash issue by restricting to a single GPU. No unrelated code modifications or out-of-scope changes are apparent in the changeset.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

icu_benchmarks/models/train.py (1)
157-158: Pre-existing bug: Condition always evaluates to True.

This is not part of your changes, but there's a logic error here. The string "16-mixed" is always truthy, so this condition always executes regardless of the precision value.

Should be:
-    if precision == 16 or "16-mixed":
+    if precision == 16 or precision == "16-mixed":
Note: This bug may have minimal impact since setting matmul precision to "medium" is often desirable anyway, but the logic should still be corrected for clarity and correctness.

🧹 Nitpick comments (1)

icu_benchmarks/models/train.py (1)
145-148: Optional: Consider clarifying the wandb logging message.

The device restriction correctly prevents wandb from creating multiple folders on multi-GPU systems. The implementation is sound.

The log message is accurate but could be more specific:
-        logging.info("Use of wandb is detected. Only single gpu training is supported with wandb.")
+        logging.info("Wandb detected. Restricting training to a single device to prevent sweep crashes.")
This makes the workaround nature explicit and uses "device" instead of "gpu" for accuracy across hardware types.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c25cb7 and 96b8cbd.

📒 Files selected for processing (1)

icu_benchmarks/models/train.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

icu_benchmarks/models/train.py (1)

icu_benchmarks/imputation/simple_diffusion.py (1)

on_fit_start (117-126)

🔇 Additional comments (2)

icu_benchmarks/models/train.py (2)

143-144: Device initialization logic looks good.

The initialization correctly defaults to at least 1 device and uses the available CUDA device count when GPUs are present. This provides a sensible baseline before the wandb-specific override.

166-166: Correct usage of the computed devices variable.

The dynamic device assignment properly integrates with the wandb-aware device management logic introduced earlier.

fix rvandewater#173. When wandb detected pass to Trainer(...,devices=…

96b8cbd

…1,...) otherwise wandb creates multiple folders and crashes.

coderabbitai Bot reviewed Oct 22, 2025

View reviewed changes

rvandewater merged commit 32760d4 into rvandewater:development Oct 22, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe…#174

fix #173. When wandb detected pass to Trainer(...,devices=1,...) othe…#174
rvandewater merged 1 commit intorvandewater:developmentfrom
rustamzh:fix-single-gpu-setup

rustamzh commented Oct 22, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rustamzh commented Oct 22, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rustamzh commented Oct 22, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 22, 2025 •

edited

Loading